A semantic segmentation model for road cracks combining channel-space convolution and frequency feature aggregation

In transportation, roads sometimes have cracks due to overloading and other reasons, which seriously affect driving safety, and it is crucial to identify and fill road cracks in time. Aiming at the defects of existing semantic segmentation models that have degraded the segmentation performance of road crack images and the standard convolution makes it challenging to capture the spatial and channel coupling relationship between pixels. It is difficult to differentiate crack pixels from background pixels in complex backgrounds; this paper proposes a semantic segmentation model for road cracks that combines channel-spatial convolution with the aggregation of frequency features. A new convolutional block is proposed to accurately identify cracked pixels by grouping spatial displacements and convolutional kernel weight dynamization while modeling pixel spatial relationships linked to channel features. To enhance the contrast of crack edges, a frequency domain feature aggregation module is proposed, which uses a simple windowing strategy to solve the problem of mismatch of frequency domain inputs and, at the same time, takes into account the effect of the frequency imaginary part on the features to model the deep frequency features effectively. Finally, a feature refinement module is designed to refine the semantic features to improve the segmentation accuracy. Many experiments have proved that the model proposed in this paper has better performance and more application potential than the current popular general model.

width variation of cracks.Benedetto et al. 9 proposed a road crack segmentation algorithm based on Unet that focuses on the width variation of segmented cracks.
The above networks achieved good results in the road crack segmentation task but were relatively computationally complex.Qi et al. 10 proposed GMDNet to solve the task of road crack segmentation in the face of discontinuities and irregularities.It introduces GhostNet as the backbone network, enhances the feature extraction capability by dynamic convolution, and keeps the computational cost low.Pang et al. 11 proposed DcsNet, a realtime deep network for crack segmentation, which retains morphological information with scale invariance and constructs small-step shallow-detail branching to supplement the detailed information.Lightweight networks often sacrifice efficiency for efficiency, reducing the generalization ability.
In this paper, focusing on accurately segmenting road cracks in the complex context of concrete road grooves, tire marks, lane lines, etc., we propose a semantic segmentation model for road cracks that combines time-space convolution with frequency feature aggregation.First, a novel convolutional block is designed based on the residual idea for constructing the backbone network to perceive the crack feature information in gradual downsampling.Secondly, a frequency feature aggregation module is added in the depth of the network to aggregate the crack body and edge information, respectively, to enhance the network's ability to recognize the crack pixels.Finally, to address the defects of feature redundancy and loss of critical features in multi-scale feature fusion, a multi-layer perceptron is first used to generate weight matrices to aggregate multi-dimensional features, and then an attention feature refinement module is designed to refine the multi-dimensional features to distinguish the background further.We evaluate the model proposed in this paper on the self-built dataset and open source dataset, respectively, and the experiments show that the model proposed in this paper can distinguish the background from the cracks more accurately, which provides a way of thinking for the automatic detection of cracks.The main contributions of this paper are as follows.
(1)A new convolutional block is designed for crack feature extraction that reduces convolution parameters with guaranteed performance.
(2)A frequency feature aggregation module is designed for crack characteristics.It uses a windowing method to solve the problem of input mismatch in the frequency domain while considering the complex frequency of features.Modeling the depth feature frequency by channel attention improves the defect that makes distinguishing between similar pixel points in the time domain difficult and enhances the crack edge features.
(3)An attention feature refinement module is proposed to reduce redundancy in multiscale feature fusion and reduce useless features from swamping valuable features.The crack features are refined to improve the segmentation accuracy.

Related work
This chapter describes the related work to the design of deep learning-based road crack segmentation modeling tasks.For most of the work, the technical route includes data preprocessing, model optimization, and improving segmentation accuracy by incorporating practical application scenarios.

Data preprocessing
Data preprocessing plays a crucial role in deep learning and directly affects the performance and effectiveness of the model.With the application of semantic segmentation models in road crack detection, researchers are often concerned about the quality of crack image acquisition.For example, the model's performance drops dramatically under intense light or high noise.Moreover, road cracks have apparent geographical differences, and different construction materials and construction methods lead to variable road quality.Although there are currently many open source datasets, such as Crack500, Crack Tree, Crack Forest Dataset, etc., the quality of the datasets varies greatly.Therefore, it is not easy to guarantee the generalization performance of models trained on these datasets.
In order to ensure the segmentation performance of the model in practice, 12 proposed a non-local filtering denoising method with fuzzy self-similarity weight estimation to remove various scattering noises on the dataset. 13used the non-local self-similarity of learnable depth features to train the regularization layer and reduce the distribution bias of the dataset effectively. 14proposed a twin transform network for image denoising to remove noise and artifacts from pictures.Although all the above networks have good denoising effects, they are not designed for crack images, and the denoising performance on the road crack dataset still needs to be further verified.In order to solve the above problems, random cropping, random rotation, random scaling, and other methods are usually used in engineering to enhance the data.However, these methods are too simple and cannot solve the problems of jitter, noise, and ghosting that occur during the acquisition of the road crack dataset well.Therefore, this paper uses intensity operations 15 to reduce the deviation of the dataset's luminance distribution and smooth the image's noise while extending the dataset by combining the above methods.Data augmentation enhances the model's training process, prevents the model from falling into local optimal solutions, and ensures the segmentation accuracy of the model.The formula for modifying the image brightness is as follows: where T denotes the filtered image, and ∇ x , y T is the differentiation of T in the x,y direction.The denominator is a Gaussian kernel of width σ 1 .In the Gaussian kernel, before differentiation in the x or y direction, we also apply a Gaussian filter ℑ σ 2 ( ) on T, where σ 2 is the spatial width.S is the original image, and is the balance (1)

Backbone networks
The backbone network plays a crucial role in neural networks 16 .It is responsible for extracting high-level features from the input data critical to the network's final output.However, as the network depth increases, gradient vanishing occurs, causing the model performance to degrade drastically.The design of ResNet50 17 has significant advantages in solving the gradient vanishing problem.The introduction of residual connectivity makes the network deeper and improves model performance and scalability.ResNet-50 performs well in several computer vision tasks, with excellent training results and many parameters.MobileNet V3 18 , a well-designed lightweight network model with streamlined parameters and fast computation speed, aims to deploy neural networks to mobile devices efficiently.Its innovation is the introduction of the lightweight SE(Squeeze and Excitation) 19 attention mechanism, which enhances the expressive power of the network by explicitly modeling the interdependence of the convolutional feature channels.At the same time, the SE attention mechanism can strengthen the features that are beneficial to the current task and suppress the features that have less impact on the task.GhostNet 20 , on the other hand, focuses on solving the feature map redundancy problem in conventional convolution.Its core idea is to generate richer features with fewer parameters.The feature maps are first obtained through small-scale convolutional operations, and then linear transformations are applied to each layer to generate Ghost feature maps.Finally, the original and Ghost feature maps are superimposed on the channel to meet the demand for computational efficiency.The backbone mentioned above networks have eye-opening advantages but have limited capability in crack feature extraction tasks.Therefore, designing a specialized feature extraction network for road crack segmentation tasks becomes essential.Based on this, this paper designs a channel-space convolution-based backbone network based on the residual idea of ResNet-50, which focuses on the spatial and channel features of the cracks and dramatically enhances the ability to recognize the crack pixels.

Methodology
In this paper, a semantic segmentation model for road cracks combining channel-space convolution and frequency feature aggregation is proposed to accurately segment road cracks in the complex context of concrete roads with carved grooves, tire marks, and dirt.The model consists of an encoder, a frequency feature aggregation module, and a decoder, as shown in Fig. 1.

Channel-space convolution
The traditional convolution operation has a limited sensory field, which can only consider the local region of the convolution kernel size and lacks the perception of the global information.In the depth feature map, the importance of different channels is different, and the convolution operation using the exact size of the convolution kernel increases many useless features and slows down the computation speed.At the same time, standard convolution is limited in the perception of structural connections in pixel space.Conventional convolutional operations perform a weighted summation of local pixels at each location but do not directly model the spatial structural connections between pixels.To solve the above problem, this paper proposes a channel-space convolution (CSConv), as shown in Fig. 2. CSConv is divided into two parallel channels, which process the channel features and spatial features, respectively.The upper channel first splits the feature map by channel, then uses attention to evaluate the importance of each channel to generate weights, and multiplies the weights with the convolution kernel so that the convolution kernel has a dynamic size.The feature map is then convolved at this point, which can discard unimportant features and enable the encoder to extract road crack features better.To achieve spatial feature aggregation, the lower channel introduces a parameter-free spatial displacement operation that aligns the feature map to the neighboring features along the channel direction.Firstly, the feature maps are grouped, and spatial displacements are applied to the feature channels of different groups to improve the aggregation of neighboring features at the corresponding positions of the convolution.Finally, the features extracted from the two channels are spliced to fuse the cracked channel-space features.
The residual convolution block is constructed using CSConv, which can better encode road crack features, as shown in Fig. 3.The improved residual block replaces the original two 3 × 3 standard convolutions with CSConv to enhance the feature extraction capability.Based on the improved residual convolution block, we constructed the backbone network to extract the depth features of road cracks.

Frequency feature aggregation module
In the time domain, the influence of the complex background can confuse the boundary between the crack and the background, making crack detection more difficult.Moreover, the time-domain signal mainly focuses on the signal variation in time.In contrast, the understanding of the structural information of the signal in space is limited, which makes the model inaccurate enough to detect delicate structures such as cracks.To address the above problems, this paper proposes the Frequency Feature Aggregation Module (FFAM), as shown in Fig. 4, which first standardizes the input feature layers and averages the input distribution.In order to overcome the feature loss due to frequency input mismatch, we first split the feature map uniformly into windows of size N, each of which is S N = [B, C, N, N] .These windows are then fast Fourier transformed to the frequency domain:  In order to extract the depth features of the road cracks, we considered both the real and imaginary parts of the complex frequencies.We aggregated the frequency features using a convolution operation.The processing of any of the complex frequencies is as follows: where X1r and X1i denote the real and imaginary parts of the frequency, and Kr and Ki denote the real and imaginary parts of the complex convolution kernel.The road crack features in the spectrum are then modeled using the channel attention mechanism to highlight valuable features in the frequency channel.The windowed frequency feature maps are inverted two-dimensional fast Fourier transformed to merge the windows and restore the feature map size to [B, C, H, W].Finally, the feature attention maps are generated using sigmoid, and the generated attention maps are multiplied with the original feature maps to enhance the crack features in the feature maps.

Attention feature refinement module
Roadway cracks usually present different shapes, including linear, reticulated, and tortoise-fractured.Linear cracks are usually elongated cracks extending along the direction of the roadway.In contrast, mesh cracks present a network of intersecting cracks, and tortoise cracks present tiny cracks similar to tortoise cracks.As the network deepens, rich semantic features of cracks can be extracted from the backbone network, which is crucial for distinguishing cracks from the background.However, as the localized features of the cracks are extracted, the spatial dimension of the feature map keeps shrinking, leading to the loss of spatial correlation information.
Therefore, this paper introduces the Attention Feature Refinement Module (AFRM), as in Fig. 5, which aims to alleviate the semantic differences between the feature maps extracted by the backbone network, enhance their spatial relevance, and reduce redundant features.First, the feature maps generated by the backbone network are spliced, and a Multilayer Perceptron(MLP) augments the weights of the underlying useful features generated attention map, which is then multiplied with the augmented feature maps of the frequency features as input to the AFRM.The AFRM consists of spatial branches and channel branches.The spatial branch first generates the spatial feature maps of the cracks using deep convolution, then extracts the inter-pixel correlations by pixel blending operation and generates the attention maps by Sigmoid function.The channel branch first performs (2)  where TP represents the count of true positives, FP denotes the count of false positives, and FN denotes the count of false negatives.It stands for the pixel accuracy of each image, which is aggregated and averaged to yield the average pixel accuracy across the entire dataset.Pi represents the accuracy of the identified pixel points corresponding to pavement cracks in the image, determined by comparing them with the pixels in the ground truth labels.Additionally, this paper introduces Floating Point Operations Counts(FLOPs) and Params to gauge the computational complexity of the network.

Model performance evaluation
In this section, the proposed road crack semantic segmentation model is validated.It is compared with the current popular semantic segmentation models Unet 21 , ResUnet 22 , DeepLabV3 23 , and AttentionUnet 24 , and the experimental results are shown in Table 1. Figure 6 shows the performance of each model on the self-built dataset, Crack Tree(CT) 25 and Crack Forest Dataset(CFD) 26 .
As can be seen from Table 1, the model proposed in this paper outperforms the other models on each dataset, achieving a balance between performance and parameters.Regarding the number of parameters and FLOPs, the model proposed in this paper has only 12.836M parameters and 6.846G FLOPs, which indicates that the model can characterize more cracks with relatively few parameters.Although DeepLabV3+ has a smaller number of parameters, it is not as good as the model proposed in this paper in terms of segmentation performance.In terms of segmentation accuracy, the accuracy of the model proposed in this paper is 85.99%, 77.15%, and 83.07% on the three datasets, respectively.Compared with Unet and ResUnet, the model proposed in this paper significantly improved, leading by 11.57% and 10.4% on the self-built dataset, respectively.Compared with DeepLabV3+, the atrous convolution dramatically reduces the number of parameters and the segmentation performance.Atten-tionUnet adds multiple attention mechanisms to Unet, resulting in the highest accuracy, but also increases the computational effort of the model, which is challenging to train.In contrast, the model proposed in this paper is much less complicated to train while not sacrificing too much accuracy.
In a comprehensive comparison, the model proposed in this paper is ahead of the current popular models in terms of performance.This is because the proposed model is constructed by the lightweight CSConv, which significantly reduces the parameters and computation of the model while maintaining good feature extraction capability.The added frequency feature aggregation and attention feature refinement modules further improve the model segmentation accuracy.
Figure 7 shows the visualization results of each model segmentation.Column 1 is the original image, column 2 is the image annotation, and columns 3 to 5 are our model, DeepLabV3+ model, Unet model, ResUnet model, and AttentionUnet model, respectively.Five typical images are extracted from each dataset, and comparing the image segmentation results, the model in this paper has a more robust generalization performance and is more accurate for segmenting pavement cracks with complex backgrounds.DeepLabV3+ has a more severe under-segmentation when facing complex cracks, especially at the interface of cracks; this is due to the fact that DeeplabV3+ does not consider the correlation between the spatial information and the perception field and cannot take full advantage of the correlation between the shallow field and the spatial information, and cannot take full advantage of the shallow field.Unet and ResUnet can only obtain the local context information by standard convolution, and the global features of the cracks are not well preserved.AttentionUnet uses the multiple attention mechanism to model the spatial and channel contexts, and it can obtain good segmentation, www.nature.com/scientificreports/but under-segmentation is serious when dealing with complex cracks.In summary, the model in this paper has better segmentation performance than other models for both simple and complex cracks.

Backbone network performance comparison
This section evaluates the proposed model's feature extraction capability compared to the popular backbone networks.These include ShuffleNetV2 27 ,MobileNetV3 28 , GhostNet 29 , and EfficientNetV2 30 , which are backbone networks based on convolutional neural network(CNN) architectures, and MobileViT 31 , which is a backbone  2.
According to Table 2, the models proposed in this paper are superior in crack feature extraction capability compared to other popular backbone networks.This is due to CSConv's modeling of local features of road cracks, which enhances the connection between feature channels and the spatial connection between crack pixels.The convolutional block proposed in this paper is based on CNN architecture, which compensates for the lack of standard convolution in local spatial feature extraction by spatial displacement operation compared to other CNN-based backbone networks, making the backbone network more sensitive to the spatial features of road cracks, and capable of extracting the detailed features of the cracks more finely.Compared with MobileViT, the FLOPs of our model increase by only 0.696 G but improve the mIoU by 2.13%.In contrast, the fixed-size convolutional kernel cannot efficiently capture features of different scales and sizes and has a limited receptive field.In contrast, the dynamic convolutional kernel of CSConv can pay targeted attention to the crack features, which improves the segmentation accuracy.

Grad-CAM visualization results
In order to analyze the regional weights of road crack images, as well as to compare the changes in the weights of crack images from different models, this paper introduces gradient-weighted class activation mapping (Grad-CAM) to generate a saliency mapping for each convolutional layer, thus highlighting the critical regions that influence the prediction of crack images.As shown in Fig. 8, ResNet50, ShuffleNet, MobileNet, and ours are selected for heat map visualization comparison, showing each layer's crack heat map after feature extraction, respectively.
As seen in Fig. 8, these four models differ in their attention to cracks and their handling of the effect of carved grooves.Specifically, ResNet50 pays more attention to cracks but handles the effect of incised slots poorly, mistaking incised slots for cracks and leading to errors.ShuffleNet overcomes the effect of incised slots but pays insufficient attention to cracks and suffers from defects in crack feature recognition in the global region.The MobileNet model focuses on the global region but has difficulty distinguishing between incised slots and cracks.The crack feature part will be lost with increased network depth, resulting in a poor segmentation effect.In contrast, the model proposed in this paper focuses more on the crack region and ignores the invalid region, reflecting the powerful feature extraction capability of CSConv.CSConv can effectively fuse the cracks' spatial and channel features and enhance the valid region's features, thus improving the segmentation capability.

Ablation experiments
In order to verify the effect of different modules on segmentation ability, ResNet50 is used as the baseline, to which CSConv, frequency feature aggregation module, and attention feature refinement module are added step by step and trained on a self-constructed dataset.The experimental results are shown in Table 3. CSConv improves the shortcomings of standard convolutional spatial modeling and mainly contributes to the average pixel accuracy of the model.The spatial shift operation achieves the aggregation of nearest-neighbor features at the corresponding position, which improves the average pixel accuracy of the model by 9.3% and does not add many parameters because the shift operation is parameterless.FFAM enhances the edge contrast of the cracks in the frequency domain, and windowing patches enhance the features' local details, contributing a 4.95% mean intersection over union to the model.AFRM reduces semantic gaps between underlying feature maps, which makes the model pay more attention to the details of the road cracks and contributes 3.37% Precision to the model.Overall, the

Conclusion
In this paper, we pinpoint the limited performance of existing semantic segmentation models in road crack detection tasks, and propose a semantic segmentation model for road cracks that combines channel-space convolution with frequency feature aggregation.By introducing a new convolutional block, the pixel-space relationship can be modeled more effectively in connection with the channel features to identify the crack pixels accurately.In addition, in order to enhance the contrast of crack edges, a frequency-domain feature aggregation module is proposed, which effectively solves the problem of mismatch of frequency-domain inputs and efficiently models the deeper frequency features by taking into account the effect of the imaginary part of the frequency on the features.Finally, a feature refinement module is designed to refine the semantic features to improve the segmentation accuracy further.The experimental results show that the proposed model outperforms the current popular generalized models for the road crack segmentation task and has superior performance and application potential.

Discussion
We use a number of comparative experiments in the experimental section to demonstrate that the proposed model is fast and accurate for road crack detection, but we still want the proposed model to be robust in the face of more complex road cracks.Therefore, in this chapter, we examine the robustness of the proposed model under bad weather and other modal data.Table 4 records the training results of the model, and it can be seen that the model performance decreases dramatically under infrared spectral data, and it still cannot accurately detect simple cracks.The model's robustness in rainy-day pavement crack detection is slightly better than that of infrared spectral data, but the segmentation accuracy is not enough.
Exploring the reasons for the performance degradation, we suggest that, first, there is no standardized dataset of other spectral or severe weather road cracks in existence.The boundaries of road cracks in these cases have low contrast with normal pavements and significant noise effects.Second, the proposed model is not yet capable of handling extreme and complex road surface conditions such as infrared spectra or rainy weather.In future work, we will improve the model to address these shortcomings and strive to be able to achieve accurate detection of road cracks in all-weather conditions.

Figure 1 .
Figure 1.Overview of the model.

Figure 7 .
Figure 7. Visualization of model segmentation results.

Figure 8 .
Figure 8. Visualization of model segmentation results.

Table 1 .
Comparison of training results of different models on each dataset.

Table 2 .
Backbone network performance comparison.
backbone network of the model in this paper accounts for only 29.2% of the number of parameters and 29.7% of the FLOPs of ResNet50.However, each module improves the model's segmentation capability.

Table 4 .
Validation of model performance on Infrared Spectrum and Rainy weather data.